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Abstract. We study the problem of efficiently clustering protein se- 
quences in a limited information setting. We assume that we do not 
know the distances between the sequences in advance, and must query 
them during the execution of the algorithm. Our goal is to find an ac- 
curate clustering using few queries. We model the problem as a point 
set S with an unknown metric d on S, and assume that we have access 
to one versus all distance queries that given a point s E S return the 
distances between s and all other points. Our one versus all query rep- 
resents an efficient sequence database search program such as BLAST, 
which compares an input sequence to an entire data set. Given a natural 
assumption about the approximation stability of the min-surn objective 
function for clustering, we design a provably accurate clustering algo- 
rithm that uses few one versus all queries. In our empirical study we 
show that our method compares favorably to well-established cluster- 
ing algorithms when wc compare computationally derived clusterings to 
gold-standard manual classifications. 



1 Introduction 

Biology is an information-driven science, and the size of available data continues 
to expand at a remarkable rate. The growth of biological sequence databases 
has been particularly impressive. For example, the size of GenBank, a biologi- 
cal sequence repository, has doubled every 18 months from 1982 to 2007. It has 
become important to develop computational techniques that can handle such 
large amounts of data. Clustering is very useful for exploring relationships be- 
tween protein sequences. However, most clustering algorithms require distances 
between all pairs of points as input, which is infeasible to obtain for very large 
protein sequence data sets. Even with a one versus all distance query such as 



BLAST (Basic Local Alignment Search Tool) |AGM+90| . which efficiently com- 
pares a sequence to an entire database of sequences, it may not be possible to 
use it n times to construct the entire pairwise distance matrix, where n is the 
size of the data set. In this work we present a clustering algorithm that gives 
an accurate clustering using only 0(fclogA;) queries, where k is the number of 
clusters. 

We analyze the correctness of our algorithm under a natural assumption 
about the data, namely the (c, e) approximation stability property of |BBGQ9| . 
Balcan et al. assume that there is some relevant "target" clustering Ct, and 
optimizing a particular objective function for clustering (such as min-sum) gives 
clusterings that are structurally close to Ct- More precisely, they assume that 
any c-approximation of the objective is e-close to C't, where the distance be- 
tween two clusterings is the fraction of misclassified points under the optimum 
matching between the two sets of clusters. Our contribution is designing an 
algorithm that given the (c, e)-property for the min-sum objective produces an 
accurate clustering using only O(fclogfc) one versus all distance queries, and has 
a runtime of O(fclog(fc)nlog(n)). We conduct an empirical study that compares 
computationally derived clusterings to those given by gold-standard classifica- 
tions of protein evolutionary relatedness. We show that our method compares 
favorably to well-established clustering algorithms in terms of accuracy. More- 
over, our algorithm easily scales to massive data sets that cannot be handled by 
traditional algorithms. 

The algorithm presented here is related to the one presented in |VBR"'"10 



The Landmark- Clustering algorithm presented there gives an accurate clustering 
if the instance satisfies the (c, e)-property for the /c-median objective. However, if 
the property is satisfied for the min-sum objective the structure of the clustering 
instance is quite different, and the algorithm given in |VBR + 10| fails to find an 



accurate clustering in such cases. Indeed, the analysis presented here is also quite 
different. The min-sum objective is also considerably harder to approximate. 
For fc-median the best approximation guarantee is (3 -I- e) given by [AGK+04j . 
For the min-sum objective when the number of clusters is arbitrary there is 
an 0((5~^ log^''"'' n)-approximation algorithm with running time n^^^^^"' due to 
\BCR01\ . 

There are also several other clustering algorithms that are applicable in our 
limited information setting |AV07IA JM09IMOP01ICS07| . However, because all 
of these methods seek to approximate an objective function they will not produce 
an accurate clustering in our model if the (c, e)-property holds for values of c for 
which finding a c-approximation is difficult. Other than [VBR"'"10] we are not 
aware of any results providing both provably accurate algorithms and strong 
query complexity guarantees in such a model. 



2 Preliminaries 



Given a metric space M = (X, d) with point set X, an unknown distance function 
d satisfying the triangle inequality, and a set of points S C X, we would like to 



find a fc-clustering C that partitions the points in S into k sets Ci, . . . , Cfc by 
using one versus all distance queries. 

The min-sum objective function for clustering is to minimize 
^(C) = X]i=i Sa; ysc '^(^tV)- reduce the min-sum clustering problem to 
the related balanced k-median problem. The balanced fc-median objective func- 
tion seeks to minimize •f'(C) = l^'ilX^Kgc "^(^iCi), where Ci is the me- 
dian of cluster Ci, which is the point y € Ci that minimizes J2x&c '^i^^v)- 
pointed out in [BCROlj . in metric spaces the two objective functions are related 
to within a factor of 2: W{C)/2 < (1>{C) < ^{C). For any objective function Q 
we use 0PTf2 to denote its optimum value. 

In our analysis we assume that S satisfies the (c, e)-property of |BBG09) 
for the min-sum and balanced fc-median objective functions. To formalize the 
(c, e)-property we need to define a notion of distance between two fc-clusterings 
C = {Ci, . . . , Ck} and C" = {C[, . . . , C^}. As in |BBG09j . we define the distance 
between C and C as the fraction of points on which they disagree under the 
optimal matching of clusters in C to clusters in C: 

k 

dist(C,C')= minl^|a-C;(,)|, 

where is the set of bijections cr: {1, . . . , fc} {!,..., fc}. Two clusterings C 
and C" are said to be e-close if dist(C, C) < e. 

We assume that there exists some unknown relevant "target" clustering Ct 
and given a proposed clustering C we define the error of C with respect to Ct as 
dist(C, Ct). Our goal is to find a clustering of low error. The (c, e) approximation 
stability property is defined as follows. 

Definition 1. We say that the instance {S,d) satisfies the (c, e) -property for 
objective function Q with respect to the target clustering Ct if any clustering 
of S that approximates OVT q within a factor of c is e-close to Ct, that is, 
^{C) < c ■ OFTn ^ dist(C, Ct) < e. 

We note that because any (1 -I- a)-approximation of the balanced fc-median 
objective is a 2(1 -I- a)-approximation of the min-sum objective, it follows that 
if the clustering instance satisfies the (2(1 -I- a), e)-property for the min-sum 
objective, then it satisfies the (1 -t- a, e)-property for balanced fc-median. 

3 Algorithm Overview 

In this section we present a clustering algorithm that given the (l-t-a, e)-property 
for the balanced fc-median objective finds an accurate clustering using few dis- 
tance queries. Our algorithm is outlined in Algorithm [l] (with some implementa- 
tion details omitted). We start by uniformly at random choosing n' points that 
we call landmarks, where n' is an appropriate number. For each landmark that we 
choose we use a one versus all query to get the distances between this landmark 
and all other points. These are the only distances used by our procedure. 



Our algorithm then expands a baU Bi around each landmark I one point 
at a time. In each iteration we check whether some ball i?;. passes the test in 
line 7. Our test considers the size of the ball and its radius, and checks whether 
their product is greater than the threshold T. If this is the case, we consider all 
balls that overlap Bi* on any points, and compute a cluster that contains all the 
points in these balls. Points and landmarks in the cluster arc then removed from 
further consideration. 

Algorithm 1 Landmark-Clustering-Min-Sum(S', fc, n', T) 

1: choose a set of landmarks L of size n' uniformly at random from 5"; 

2: i = l,r = 0; 

3: while i < k do 

4: for each / G L do 

5: Bi = {s e S \ d{s,l) <r}; 

6: end for 

7: if 3r e L : IBp] ■ r > r then 
8: L' = {leL: Bin Bi* / 0}; 
9: C, = {s £ S : a £ Bi &nA I e L'}; 

10: i = i + l; 

11: remove clustered points from consideration; 
12: end if 

13: increment r to the next relevant distance; 
14: end while 

15: return C = {Ci, . . . Cj,}; 



A complete description of this algorithm can be found in the next section. 
We now present our theoretical guarantee for Algorithm [T] 

Theorem 1. Given a metric space M — {X,d), where d is unknown, and a 
set of points S, if the instance {S,d) satisfies the (1 + a, e) -property for the 
balanced-k -median objective function, we are given the optimum objective value 
OPT, and each cluster in the target clustering Ct has size at least (6 + 240/a)en, 
then Landmark-Clustering-Min-Sum^'S', /c, n', ^^^) outputs a clustering that is 
0{e/a)-close to Ct with probability at least 1 — S. The algorithm uses n' = 
(3+i20/a)£ f '^^^ versus all distance queries, and has a runtime of 0{n'n\ogn). 

We note that n' — 0(fcln|) if the sizes of the target clusters are balanced. 
In addition, if we do not know the value of OPT, we can still find an accurate 
clustering by running Algorithm [l] from line 2 at most n'n? times with increasing 
estimates of T until enough points are clustered. It is not necessary to recompute 
the landmarks, so the number of distance queries that are required remains the 
same. We next give some high-level intuition for how our procedures work. 

Given our approximation stability assumption, the target clustering must 
have the structure shown in Figure [TJ Each target cluster Ci has a "core" of 
well-separated points, where any two points in the cluster core are closer than 
a certain distance di to each other, and any point in a different core is farther 
than cdi, for some constant c. Moreover, the diameters of the cluster cores are 




Fig. 1. Cluster cores Ci, C2 and C3 are shown with diameters di, d2 and da, respec- 
tively. The diameters of the cluster cores are inversely proportional to their sizes. 



inversely proportional to the cluster sizes: there is some constant 9 such that 
\Ci\- di = 6 for each cluster Ci. Given this structure, it is possible to classify the 
points in the cluster cores correctly if we extract the smaller diameter clusters 
first. In the example in Figure [l] we can extract Ci, followed by C2 and C3 if 
we choose the threshold T correctly and we have selected a landmark from each 
cluster core. However, if we wait until some ball contains all of C3, Ci and C2 
may be merged. 

4 Algorithm Analysis 

In this section we present a formal analysis of our algorithm, and give the proof 
of Theorem [T] We first present a complete description of the algorithm. We 
then describe the structure of the clustering instance that is implied by our 
approximation stability assumption. We then give a general overview of our 
argument, which is followed by the complete proof. 

4.1 Algorithm Description 

A full description of our algorithm is given in Algorithm [2j In order to efficiently 
expand a ball around each landmark, we first sort all landmark-point pairs {I, s) 
by d{l, s). We then consider these pairs in order of increasing distance (line 7), 
skipping pairs where / or s have already been clustered; the clustered points are 
maintained in the set S. 

In each iteration we check whether some ball B/* passes the test in line 19. 
Our actual test, which is slightly different than the one presented earlier, con- 
siders the size of the ball and the next largest landmark-point distance (denoted 
by ^2), and checks whether their product is greater than the threshold T. If this 
is the case, we consider all balls that overlap Bi* on any points, and compute 
a cluster that contains all the points in these balls. Points and landmarks in 
the cluster are then removed from further consideration by adding the clustered 
points to S, and removing the clustered points from any ball. 

Our procedure terminates once we find k clusters. If we reach the final 
landmark-point pair, we stop and report the remaining unclustered points as 



part of the same cluster (line 12). If the algorithm terminates without partition- 
ing all the points, we assign each remaining point to the cluster containing the 
closest clustered landmark. In our analysis we show that if the clustering instance 
satisfies the (l + a, e)-property for the balanced A;-median objective function, our 
procedure will output exactly k clusters. 

The most time-consuming part of our algorithm is sorting all landmark- 
points pairs, which takes 0(|L|nlogn), where n is the size of the data set and 
L is the set of landmarks. With a simple implementation that uses a hashed 
set to store the points in each ball, the total cost of computing the clusters and 
removing clustered points from active balls is at most 0{\L\n) each. All other 
operations take asymptotically less time, so the overall runtime of our procedure 
is 0{\L\nlogn). 



Algorithm 2 Landmark-Clustering-Min-Sum(S', A;, n', T) 

1: choose a sot of landmarks L of size n' uniformly at random from S; 

2: for each I £ L do 

3: Bi = 0; 

4: end for 

5: i = l, S = 9; 

6: while i < k do 

7: (Z,s) = GetNextActivePair(); 

8: ri = d{l, s); 

9: if ((r, s') = PeekNextActivePairO) ! = null then 
10: r2 = d{l',s'); 
11: else 
12: Ci = S-S; 
13: break; 
14: end if 
15: Bi=B, + {s}; 
16: if ri == r2 then 
17: continue; 
18: end if 

19: while 31 e L - S : \Bi\ > Tjri and i < fc do 

20: r = argmax;gj^__s|B(|; 

21: L' = {Z € L - 5 : B( n Bi, ^ 0}; 

22: Ci = {s € 5 : s G and / £ L'}; 

23: for each s £ d do 

24: S-^S + js}; 

25: for each I £ L do 

26: Bi = Bi- {s}; 

27: end for 

28: end for 

29: i = i + 

30: end while 

31: end while 

32: return C = {Ci, . . . Ck}; 



4.2 Structure of the Clustering Instance 

We next describe the structure of the clustering instance that is impHed by 
our approximation stabihty assumption. We denote by C* = {C]^,...,C^} 
the optimal balanced-fc-median clustering with objective value OVT=W{C*). 
For each cluster C*, let c* be the median point in the cluster. For x G C* , 

define w{x) = \C*\d{x,c*) and let w = a.vgrcw{x) — — . Define W2{x) = 

minj ^i\C*\d{x,c* y 

It is proved in |BBG09| that if the instance satisfies the (1 + a, e)-property for 
the balanced fc-median objective function and each cluster in C* has size at least 
max(6, 6/a) ■ en, then at most 2e-fraction of points x G S have ^2(2:) < In 
addition, by definition of the average weight w at most 120e/a-fraction of points 
X e S have wix) > 

We call point x good if both w{x) < and W2{x) > else x is called 
bad. Let Xi be the good points in the optimal cluster C* , and let _B = S" \ UXi 
be the bad points. 

Lemmajl] which is similar to Lemma 14 of [I3BG09 , proves that the optimum 
balanced fc-median clustering must have the following structure: 

1. For all X, y in the same Xi, we have d{x, y) < qq"\%* i ■ 



3. The number of bad points is at most b — {2 + 120/a)en. 
4.3 Proof of Theorem [l] 

Our algorithm expands a ball around each landmark, one point at a time, until 
some ball is large enough. We use ri to refer to the current radius of the balls, 
and r2 to refer to the next relevant radius (next largest landmark-point distance). 
To pass the test in line 19, a ball must satisfy \Bi\ > T/r2- We choose T such 
that by the time a ball satisfies the conditional, it must overlap some good set 
Xi. Moreover, at this time the radius must be large enough for Xi to be entirely 
contained in some ball; Xi will therefore be part of the cluster computed in line 
22. However, the radius is too small for a single ball to overlap diff'erent good sets 
and for two balls overlapping different good sets to share any points. Therefore 
the computed cluster cannot contain points from any other good set. Points 
and landmarks in the cluster are then removed from further consideration. The 
same argument can then be applied again to show that each cluster output by 
the algorithm entirely contains a single good set. Thus the clustering output by 
the algorithm agrees with C* on all the good points, so it must be closer than 
b + e = 0{e/a) to Ct- A more detailed argument is given below. 

Proof. Since each cluster in the target clustering has more than (6 + 240/a)en 
points, and the optimal balanced-fc-median clustering C* can differ from the 
target clustering by fewer than en points, each cluster in C* must have more than 
(5-|-240/a)en points. Moreover, by Lemma[l]we may have at most {2 + 120/a)en 
bad points, and hence each \Xi\ = \C*\B\ > {3+120/a)en > (2-|-120/a)en-h2 = 
b + 2. We will use Smin to refer to the (3 -|- 120/a)en quantity. 




Our argument assumes that we have chosen at least one landmark from each 
good set Xi. Lemma 2j argues that after selecting n' ~ ^^ln| = (3^i2o/a)£ ^-'^g 
landmarks the probaHility of this happening is at least 1 — d. Moreover, if the 
target clusters are balanced in size: maxcgCT IC'I/ mincGCT 1^*1 < c for some 
constant c, because the size of each good set is at least half the size of the 
corresponding target cluster, it must be the case that 2s,„inC- fc > n, so n/s^in = 
Oik). 

Suppose that we order the clusters of C* such that \Cl\ > [Cjl > . . . |C^|, 
and let Ui — \C*\. Define di — qq"^, | and recall that max^^yizXi d{x,y) < di. 
Note that because there is a landmark in each good set Xi, for radius r > di 
there exists some ball containing all of Xi. We use Bi{r) to denote a ball of 
radius r around landmark I: Bi{r) : {s e 5 | d{s, I) < r}. 

If we apply Lemma [3] with all the clusters in C* , we can see that as long as 
r < 3(ii, a ball cannot contain points from more than one good set and balls 
overlapping different good sets cannot share any points. We also observe that 
when both r < 3di and r < di are true, a ball Bi{r) containing points from 
Xi does not satisfy > T/r. For r < 3di a ball cannot contain points 

from different good sets; therefore any ball containing points from Xi has size 
at most |C*| + & < In addition, for r < di the size bound T/r > T/di = 
55e / gQ°|p,| = ^ . Therefore for these values of r any ball containing points from 
Xi is too small to satisfy the conditional. 

Finally, we observe that for r ~ 3di some ball Bi{r) containing all of Xi does 
satisfy > T/r. Clearly, for r — 3di there is some ball containing all of 

Xi, which must have size at least \Ci \ — b > ni/2. For r = 3di the size bound 
T/r = ni/2, so this ball is large enough to satisfy this conditional. Moreover, for 
r < 3di the size bound T/r > ni/2. Therefore a ball containing only bad points 
cannot pass our test for r < 3di because the number of bad points is at most 
b < ni/2. 

Consider the smallest radius r* for which some ball Bi* (r*) satisfies \Bi- (r*)| > 
T/r* . It must be the case that r* < 3di, and -B/* overlaps with some good set Xi 
because we cannot have a ball containing only bad points for r* < 3di. Moreover, 
by our previous argument because Bi* contains points from Xi, it must be the 
case that r* > di, and therefore some ball contains all the points in Xi. Consider 
a cluster C of all the points in balls that overlap Bi* : C — {s G S \ s E Bi and 
Bi n Bi* ^ 0}, which must include all the points in Xi. In addition, Bi* cannot 
share any points with balls that overlap other good sets because r* < 3di , there- 
fore C does not contain points from any other good set. Therefore the cluster C 
entirely contains some good set and no points from any other good set. 

These facts suggest the following conceptual algorithm for finding a clustering 
that classifies all the good points correctly: increment r until some ball satisfies 
1-8/(^)1 > T/r, compute the cluster containing all points in balls that overlap 
Bi{r), remove these points, and repeat until we find k clusters. We can argue 
that each cluster output by the algorithm entirely contains some good set and 
no points from any other good set. Each time we can consider the clusters C C 
C* whose good sets have not yet been output, order them by size, and apply 



Lemma [3] with C to argue that while r < 3di the radius is too small for the 
computed cluster to overlap any of the remaining good sets. As before, we can 
argue that by the time we reach 3c?i we must output some cluster. In addition, 
when r < 3di we cannot output a cluster of only bad points and whenever 
we output a cluster overlapping some good set Xi, it must be the case that 
r > di. Therefore each computed cluster must entirely contain some good set 
and no points from any other good set. If there are any unclustered points upon 
the completion of the algorithm, we can assign the remaining points to any 
cluster. Still, we are able to classify all the good points correctly, so the reported 
clustering must be closer than b + dist(C*, Ct) < b + e = 0{e/a) to Ct- 

It suffices to show that even though our algorithm only considers discrete 
values of r corresponding to landmark-point distances, the output of our proce- 
dure exactly matches the output of the conceptual algorithm described above. 
Consider the smallest (continuous) radius r* for which some ball Bi^ (r*) satisfies 
|-B/^(r*)| > T/r*. We use dreai to refer to the largest landmark-point distance 
such that dreai < ^* ■ Clearly, by the time our algorithm reaches ri — dreai it 
must be the case that Bi^ passes the test on line 19: \Bi^\ > T/r2, and this 
test is not passed by any ball at any prior time. Moreover, i?;^ must be the 
largest ball passing our test at this point because if there is another ball Bi^ 
that also satisfies our test when ri = dreai it must be the case that jSjJ > {Bi^l 
because B/^ satisfies |i?;j(r)| > T/r for a smaller r. Finally because there are 
no landmark-point pairs (^,s) with ri < d{l,s) < r2, Bi{ri) = Bi(r*) for each 
landmark I e L. Therefore the cluster that we compute on line 22 for Bi^{ri) 
is equivalent to the cluster the conceptual algorithm computes for Bi^{r*). We 
can repeat this argument for each cluster output by the conceptual algorithm, 
showing that Algorithm [2] finds exactly the same clustering. 

We note that when there is only one good set left the test in line 19 may 
not be satisfied anymore if 3c?i > va'axx^y^s d{x,y), where di is the diameter of 
the remaining good set. However, in this case if we exhaust all landmark-points 
pairs we report the remaining points as part of a single cluster (line 12), which 
must contain the remaining good set, and possibly some additional bad points 
that we consider misclassified anyway. 

With a simple implementation that uses a hashed set to keep track of the 
points in each ball, the runtime of our procedure is 0(|L|7i log n), which is given 
by the time necessary to sort all landmark-point pairs by distance. All other 
operations take asymptotically less time. In particular, over the entire run of the 
algorithm, the cost of computing the clusters in lines 21-22 is at most 0{n\L\)^ 
and the cost of removing clustered points from active balls in lines 23-28 is also 
at most 0{n\L\). □ 

Theorem 2. // we are not given the optimum objective value w, then we can 
still find a clustering that is 0(e/a)-close to Ct with probability at least 1 — S by 
running Landmark- Clustering- Min- Sum at most n'n^ times with the same set of 
landmarks, where the number of landmarks n' ~ (3+i20/a)e f '^^ before. 

Proof. If we are not given the value of w then we have to estimate the threshold 
parameter T for deciding when a cluster develops. Let us use T* to refer to its 



correct value (T* — ^). We first note that there are at most n ■ n\L\ relevant 
values of T to try, where L is the set of landmarks. Our test in line 19 checks 
whether the product of a ball size and a ball radius is larger than T, and there 
are only n possible ball sizes and \L\n possible values of a ball radius. 

Suppose that we choose a set of landmarks L, \L\ = n' , as before. We then 
compute all n'n'^ relevant values of T and order them in ascending order: Ti < 
Ti^i for 1 < i < n'm?. Then we repeatedly execute Algorithm [2] starting on 
line 2 with increasing estimates of T. Note that this is equivalent to trying all 
continuous values of T in ascending order because the execution of the algorithm 
does not change for any T' such that < T' < T^+i. In other words, when 
Ti < T' < Ti+i, the algorithm will give the same exact answer for Ti as it would 
for T'. 

Our procedure stops the first time we cluster at least n — h points, where b 
is the maximum number of bad points. We give an argument that this gives an 
accurate clustering with an additional error of b. 

As before, we assume that we have selected at least one landmark from each 
good set, which happens with probability at least 1 — ^. Clearly, if we choose the 
right threshold T* the algorithm must cluster at least n — b points because the 
clustering will contain all the good points. Therefore the first time the algorithm 
clusters at least n — b points for some estimated threshold T, it must be the case 
that T <T*. Lemma[4] argues that if T < T* and the number of clustered points 
is at least n — b, then the reported partition must be a /c-clustering that contains 
a distinct good set in each cluster. This clustering may exclude up to b points, all 
of which may be good points. Still, if we arbitrarily assign the remaining points 
we will get a clustering that is closer than 2b + e = 0{e/a) to Ct- □ 

Lemma 1. // the balanced k-median instance satisfies the (1 + a, e)-property 
and each cluster in C* has size at least max(6, 6/a) • en we have: 

1. For all x,y in the same Xi, we have d{x,y) < gofj^Wj- 

2. For X e X, and y G X,^,, d{x,y) > fi/min(|C*|, \C*\). 

3. The number of bad points is at most 6 = (2 + 120/a)en. 

Proof. For part 1, since x,y G Xi C C* are both good, they are at distance of 
at most ]^20e|C"' | ' hence at distance of at most ^^f^^,^ to each other. 

For part 2 assume without loss of generality that |C*| > |C*|. Both x £ C* 
and y S C* are good; it follows that d{y,c*) < i2o"|c-'| ' ^^'^ d{x,c*) > j^f^y 
because \C*\d{x,c*) > W2{x) > By the triangle inequality it follows that 

d{x,y) > dix,c*) - diy,c*) > - ^) > |^/min(|a*|, \C*\), 



where we use that |C*| = min(|C*|, |C*|). 

Part 3 follows from the maximum number of points that may not satisfy each 
of the properties of the good points and the union bound. □ 



Lemma 2. After selecting 7 In | points uniformly at random, where s is the 
size of the smallest good set, the probability that we did not choose a point from 
every good set is smaller than 1 — S. 

Proof. We denote by Si the cardinality of Xi. Observe that the probabihty of 
not selecting a point from some good set Xi after ^ samples is (1 — < 

n") — (e""" )~ = e^^- By the union bound the probability of not selecting 
a point from every good set after ^ samples is at most ke~'^, which is equal to 
(5forc = ln|. □ 

Lemma 3. Given a subset of clusters C C. C* , and the set of the corresponding 
good sets X , let Smax = niaxp-gc" \Ci\ be the size of the largest cluster in C, and 
dmin = gg"/' — • Then for r < 3dmin, a ball cannot overlap a good set Xi ^ X and 
any other good set, and a ball containing points from a good set Xi X cannot 
share any points with a ball containing points from any other good set. 

Proof. By part 2 of Lemma [Tj for x E Xi and y G Xj^i we have 

O.W 

d{x,y)>—/nM\C:\,\q\). 

It follows that for x (z Xi £ X and y G ^j^i we must have d{x,y) > 
^/mm{\C:\,\C;\) > ^/\C*\ > f^/w = 12d„,in, where we use the fact 
that \Ci\ < Smax- So a point in a good set in X and a point in any other good 
set must be farther than 12(iinin- 

To prove the first part, consider a ball Bi of radius r < Sdmin around land- 
mark I. In other words, Bi = {s G S \ d{s, I) < r}. If Bi overlaps a good set 
in Xi €z X and any other good set, then it must contain a point x G Xi and a 
point y e Xj:^i. It follows that d{x,y) < d{x,l) + d{l,y) < 2r < 6dniin, giving a 
contradiction. 

To prove the second part, consider two balls Bi-^ and Bi^ of radius r < Sdmin 
around landmarks and I2. Suppose Bi-^ and Bi^ share at least one point: 
Bi-^ n B12 0, and use s* to refer to this point. It follows that the distance 
between any point x G Bi-^ and y G Bi^ satisfies d{x,y) < d{x,s*) + d{s*,y) < 
[dix, h) + d{h,s*)] + Ks*, Z2) + dih, y)] < 4r < 12d^i„. 

If Bi^ overlaps with Xi G X and Bi^ overlaps with Xj^i, and the two balls 
share at least one point, there must be a pair of points x G Xi and y € ^j^i 
such that d{x, y) < 12(iniin, giving a contradiction. Therefore if Bi-^ overlaps with 
some good set Xi E X and Bi^ overlaps with any other good set, Bi^ H Bi^ = 0. 

□ 

Lemma 4. If T < T* ^ and the number of clustered points is at least n—b, 
then the clustering output by Landmark- Clustering- Min-Sum using the threshold 
T must be a k-clustering that contains a distinct good set in each cluster. 

Proof. Our argument considers the points that are in each cluster that is output 
by the algorithm. Let us call a good set covered if any of the clusters Ci, . . . , Ci-i 
found so far contain points from it. We will use C* to refer to the clusters in 



C* whose good sets are not covered. It is critical to observe that if T < T* then 
if Ci contains points from an uncovered good set, Ci cannot overlap with any 
other good set. 

To see this, let us order the clusters in C* by decreasing size: \Cl \ > IC2I > 
...|C*|, and let Ui ~ |C*|. As before, define di — gg^^Wj- Applying Lemma |3j 

with C* we can see that for r < 3di , a ball of radius r cannot overlap a good set 
in C* and any other good set, and a ball containing points from a good set in 
C* cannot share any points with a ball containing points from any other good 
set. Because T < T* we can also argue that by the time we reach r — 3di we 
must output some cluster. 

Given this observation, it is clear that the algorithm can cover at most one 
new good set in each cluster that it outputs. In addition, if a new good set 
is covered this cluster may not contain points from any other good set. If the 
algorithm is able to cluster at least n — h points, it must cover every good set 
because the size of each good set is larger than h. So it must report k clusters 
where each cluster contains points from a distinct good set. □ 

5 Experimental Results 

We present some preliminary results of testing our Landmark- Clustering- Min- 
Sum algorithm on protein sequence data. Instead of requiring all pairwise sim- 
ilarities between the sequences as input, our algorithm is able to find accurate 
clusterings by using only a few BLAST calls. For each data set we first build 
a BLAST database containing all the sequences, and then compare only some 
of the sequences to the entire database. To compute the distance between two 
sequences, we invert the bit score corresponding to their alignment, and set the 
distance to infinity if no significant alignment is found. In practice we find that 
this distance is almost always a metric, which is consistent with our theoretical 
assumptions. 

In our computational experiments we use data sets created from the Pfam 
jFMT+lOj (version 24.0, October 2009) and SCOP |MBHC95) (version 1.75, 
June 2009) classification databases. Both of these sources classify proteins by 
their evolutionary relatedness, therefore we can use their classifications as a 
ground truth to evaluate the clusterings produced by our alg orithm and other 
methods. These are the same data sets that were used in the |VBR"'"10] study, 
therefore we also show the results of the original Landmark- Clustering algo- 
rithm on these data, and use the same amount of distance information for both 
algorithms (30fc landmarks/queries for each data set, where k is the number 
of clusters). In order to run Landmark- Clustering- Min- Sum we need to set the 
parameter T. Because in practice we do not know its correct value, we use in- 
creasing estimates of T until we cluster enough of the points in the data set; 
this procedure is similar to the algorithm for the case when we don't know the 
optimum objective value OPT and hence don't know T . In order to compare a 
computationally derived clustering to the one given by the gold-standard classi- 
fication, we use the distance measure from the theoretical part of our work. 



Because our Pfam data sets are so large, we cannot compute the full dis- 
tance matrix, so we can only compare with methods that use a limited amount 
of distance information. A natural choice is the following algorithm: uniformly 
at random choose a set of landmarks L, \L\ = d] embed each point in a d- 
dimensional space using distances to L; use fc-means clustering in this space 
(with distances given by the Euclidian norm) . This procedure uses exactly d one 
versus all distance queries, so we can set d equal to the number of queries used 
by the other algorithms. For SCOP data sets we are able to compute the full 
distance matrix, so we can compare with a spectral clustering algorithm that 
has been shown to work very well on these data |PCS06j . 

From Figure [2] we can see that Landmark- Clustering- Min- Sum outperforms 
fc-means in the embedded space on all the Pfam data sets. However, it does 
not perform better than the original Landmark- Clustering algorithm on most of 
these data sets. When we investigate the structure of the ground truth clusters in 
these data sets, we see that the diameters of the clusters are roughly the same. 
When this is the case the original algorithm will find accurate clusterings as 
well [VBR"*" 10] . Still, Landmark- Clustering- Min- Sum tends to give better results 
when the original algorithm does not work well (data sets 7 and 9). 



■ K-Means-Embedded-Space ■ Landmark-Clustering 
b Landmark-Clustering-Min-Sum 



Ul 0.4 




1234567SS10 



dataset 

Fig. 2. Comparing the performance of fe-means in the embedded space (blue), 
Landmark- Clustering (red), and Landmark-Clustering-Min-Sum (green) on 10 data 
sets from Pfam. Datasets 1-10 are created by uniformly at random choosing 8 families 
from Pfam of size s, 1000 < s < 10000. 



Figure |3] shows the results of our computational experiments on the SCOP 
data sets. We can see that the three algorithms are comparable in performance 



here. These results are encouraging because the spectral clustering algorithm 
significantly outperforms other clustering algorithms on these data |PCS06j . 
Moreover, the spectral algorithm needs the full distance matrix as input and 
takes much longer to run. When we examine the structure of the SCOP data 
sets, we find that the diameters of the ground truth clusters vary considerably, 
which resembles the structure implied by our approximation stability assump- 
tion, assuming that the target clusters vary in size. Still, most of the time the 
product of the cluster sizes and their diameters varies, so it does not quite look 
like what we assume in the theoretical part of this work. 
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Fig. 3. Comparing the performance of spectral clustering (blue), Landmark- Clustering 
(red), and Landmark- Clustering- Min- Sum (green) on 10 data sets from SCOP. Data 
sets A and B are the two main examples from |PCS06| , the other data sets (1-8) 
are created by uniformly at random choosing 8 superfamilies from SCOP of size s, 
20 < s < 200. 



We plan to conduct further studies to find data where clusters have different 
scale and there is an inverse relationship between cluster sizes and their diam- 
eters. This may be the case for data that have many outliers, and the correct 
clustering groups sets of outliers together rather than assigns them to arbitrary 
clusters. The algorithm presented here will consider these sets to be large diam- 
eter, small cardinality clusters. More generally, the algorithm presented here is 
more robust because it will give an answer no matter what the structure of the 
data is like, whereas the original Landmark- Clustering algorithm often fails to 
find a clustering if there are no well-defined clusters in the data. 



6 Conclusion 



We present a new algorithm that clusters protein sequences in a limited informa- 
tion setting. Instead of requiring all pairwise distances between the sequences as 
input, we can find an accurate clustering using few BLAST calls. We show that 
our algorithm produces accurate clusterings when compared to gold-standard 
classifications, and we expect it to work even better on data who structure more 
closely resembles our theoretical assumptions. 
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